[FLINK-37730][Job Manager] Expose JM exception as K8s exceptions #978

vsantwana · 2025-05-06T17:44:13Z

What is the purpose of the change

(For example: This pull request adds a new feature to periodically create and maintain savepoints through the FlinkDeployment custom resource.)

This pull requests adds a new feature to periodically check for job exceptions using the FLINK REST API for getting the exceptions and raise them as kubernetes events. This feature will be helpful for monitoring systems that want to do a post processing on the job exceptions.

This is ONLY introduced for Application mode and NOT Session mode.

Brief change log

(for example:)

Periodic pulling of job exceptions (only done when the job manager is NOT in a terminal state)
New SYSTEM_ADVANCED config for configuring the max number of exceptions reported and max lenght for stacktrace (defaults are 5 and 10 respectively)
Most of the information is in the k8s event
Introduced new utility method to trigger the event with annotations
Introduced new test method to help testing the configuration

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Apart from the unit tests in this PR, this was tested using two simulations:
sql-test: Good running job, exception simulated by manually killing the TM
sql-test-failing: Job that has exception in the open method, repeatedly fails.

Both the exceptions were produced one at a time, and both simultaneously.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changes to the CustomResourceDescriptors: (yes / no)
Core observer or reconciler logic that is regularly executed: Yes, this changes the observer.

Documentation

Does this pull request introduce a new feature? Yes
If yes, how is the feature documented? I have to figure out the documentation part

rmetzger

I like this draft a lot.

@gyfora can you take a quick look as well, or would you prefer this to be fully ready before you take a look?

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

gyfora · 2025-05-06T20:20:54Z

I like this draft a lot.

@gyfora can you take a quick look as well, or would you prefer this to be fully ready before you take a

I like this draft a lot.

@gyfora can you take a quick look as well, or would you prefer this to be fully ready before you take a look?

Thank you!
I will try to take a look in the next 1-2 days :)

morhidi · 2025-05-06T20:38:33Z

I like this draft a lot.
@gyfora can you take a quick look as well, or would you prefer this to be fully ready before you take a

I like this draft a lot.
@gyfora can you take a quick look as well, or would you prefer this to be fully ready before you take a look?

Thank you! I will try to take a look in the next 1-2 days :)

It'd be great to catch and turn every job exception into a k8s event, not just for terminal job failures. It'd simplify collecting historical diagnostic data before an actual crash occurs.

vsantwana · 2025-05-07T09:12:37Z

It'd be great to catch and turn every job exception into a k8s event, not just for terminal job failures. It'd simplify collecting historical diagnostic data before an actual crash occurs.

@morhidi Sorry I do not understand this. I am not checking for only terminal job failures. I am checking for all the failures, when the job is not in one of the terminal states.
Lmk if I have misunderstood your comment.

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

...ernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/EventRecorder.java

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

morhidi · 2025-05-07T12:22:12Z

It'd be great to catch and turn every job exception into a k8s event, not just for terminal job failures. It'd simplify collecting historical diagnostic data before an actual crash occurs.

@morhidi Sorry I do not understand this. I am not checking for only terminal job failures. I am checking for all the failures, when the job is not in one of the terminal states. Lmk if I have misunderstood your comment.

nm I miss-read it at first glance

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

...api/src/main/java/org/apache/flink/kubernetes/operator/api/status/FlinkDeploymentStatus.java

...c/main/java/org/apache/flink/kubernetes/operator/config/KubernetesOperatorConfigOptions.java

.../main/java/org/apache/flink/kubernetes/operator/observer/deployment/ApplicationObserver.java

helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml

.../main/java/org/apache/flink/kubernetes/operator/observer/deployment/ApplicationObserver.java

rmetzger · 2025-05-16T10:06:40Z

...kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/EventUtils.java

    public static boolean createIfNotExists(
            KubernetesClient client,
            HasMetadata target,


Why can't we call createWithAnnotationsIfNotExists() from this method to avoid code duplication?

I had thought about it but I did not do it because of the event time. In our case we had decided to set the exception time as event time, but I am not aware of how should it happen for other k8s events, so I kept them separate with the cost of duplicated code.
cc @gyfora

...ator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkResourceContext.java

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

.../src/main/java/org/apache/flink/kubernetes/operator/service/FlinkResourceContextFactory.java

gyfora

I think this is looking pretty good now, I added a few minor comments still

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

...perator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java

.../src/main/java/org/apache/flink/kubernetes/operator/service/FlinkResourceContextFactory.java

...rator/src/test/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserverTest.java

gyfora

Looks good! As a future followup we could think about reducing the number of REST API calls we make to fetch exceptions.

At the moment this is done on every step but based on the job details that we get in previous steps in the observers we may be able to deduct that the job did not fail since the last time we checked so exceptions do not need to be queried.

If you could open a follow up ticket for that I think that would be nice :)

gyfora · 2025-05-21T13:17:54Z

You need to regenerate the docs: mvn clean install -DskipTests -Pgenerate-doc

gyfora · 2025-05-21T15:44:43Z

I hit the following error while running locally:

2025-05-21 15:42:22,998 o.a.f.k.o.o.JobStatusObserver  [WARN ][default/basic-example] Failed to fetch JobManager exception info.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1:443/api/v1/namespaces/default/events. Message: Event in version "v1" cannot be handled as a Event: parsing time "2025-05-21T15:41:41.602Z[UTC]" as "2006-01-02T15:04:05.000000Z07:00": cannot parse ".602Z[UTC]" as ".000000". Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=Event in version "v1" cannot be handled as a Event: parsing time "2025-05-21T15:41:41.602Z[UTC]" as "2006-01-02T15:04:05.000000Z07:00": cannot parse ".602Z[UTC]" as ".000000", metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=BadRequest, status=Failure, additionalProperties={}).

So something seems to be off with the time handling

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

gyfora · 2025-05-22T08:21:28Z

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

+                }
+            }
+            ctx.getExceptionCacheEntry().setJobId(currentJobId);
+            ctx.getExceptionCacheEntry().setLastTimestamp(now.toEpochMilli());


Shouldn't this be the max exception timestamp? It could happen that there are job exceptions between getting it in the rest api and emitting them here and those would be missed if we set a higher timestamp based on now

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

gyfora · 2025-05-23T07:23:51Z

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

+        String stacktrace = exception.getStacktrace();
+        if (stacktrace != null && !stacktrace.isBlank()) {
+            String[] lines = stacktrace.split("\n");
+            eventMessage.append("\n\nStacktrace (truncated):\n");


Can we maybe remove this line completely? it seems to just increase the messages and adds an empty line:

Also seems like the first line with the exception name and the stack trace is basically duplicated

vsantwana added 2 commits May 6, 2025 23:11

[FLINK-37730][Job Manager] Expose JM exception as K8s exceptions

2ee8c55

[FLINK-37730][dependency] Removes unintended dependency

22b84b9

rmetzger reviewed May 6, 2025

View reviewed changes

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java Outdated Show resolved Hide resolved

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java Outdated Show resolved Hide resolved

gyfora reviewed May 7, 2025

View reviewed changes

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java Outdated Show resolved Hide resolved

gyfora reviewed May 7, 2025

View reviewed changes

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java Outdated Show resolved Hide resolved

[FLINK-37730][Observer] Introduces observer and related configuration

7d011c3

gyfora reviewed May 8, 2025

View reviewed changes

...api/src/main/java/org/apache/flink/kubernetes/operator/api/status/FlinkDeploymentStatus.java Outdated Show resolved Hide resolved

[FLINK-37730][Cache] Add cache to store last recorded exception time

c434722

gyfora requested changes May 8, 2025

View reviewed changes

vsantwana added 2 commits May 16, 2025 14:32

[FLINK-37730] Moves exception emitter to JobStatusObserver

3c3a1b5

[FLINK-37730] Remove unintended changes

6fdbf84

vsantwana marked this pull request as ready for review May 16, 2025 09:09

vsantwana requested review from gyfora, morhidi and rmetzger May 16, 2025 09:09

rmetzger reviewed May 16, 2025

View reviewed changes

gyfora requested changes May 16, 2025

View reviewed changes

[FLINK-37730][Review] Address comments

61ad2cd

vsantwana requested review from gyfora and rmetzger May 19, 2025 10:28

gyfora requested changes May 20, 2025

View reviewed changes

vsantwana added 2 commits May 20, 2025 21:42

Addresses review comments

f1e9320

Adds check for exceptions when prevState is terminal

4722465

vsantwana requested a review from gyfora May 20, 2025 17:10

gyfora approved these changes May 21, 2025

View reviewed changes

gyfora reviewed May 21, 2025

View reviewed changes

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java Outdated Show resolved Hide resolved

[FLINK-37730]Reverse the exception to get newer exceptions

5104e13

gyfora reviewed May 22, 2025

View reviewed changes

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java Outdated Show resolved Hide resolved

gyfora requested changes May 22, 2025

View reviewed changes

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java Outdated Show resolved Hide resolved

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java Outdated Show resolved Hide resolved

[FLINK-37730]Fixes the failures

8f1cab7

vsantwana requested a review from gyfora May 22, 2025 15:49

vsantwana added 2 commits May 22, 2025 21:28

Adds generated docs

1b5e654

Changes jobManagerDeployment to job

7414fc4

gyfora approved these changes May 23, 2025

View reviewed changes

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java Outdated Show resolved Hide resolved

gyfora reviewed May 23, 2025

View reviewed changes

vsantwana added 2 commits May 23, 2025 13:17

[FLINK-37730][Exception] Beautify Exception reporting in events

8c45e14

Updates test to match new code

0f64ad4

gyfora merged commit 0679d63 into apache:main May 23, 2025
130 checks passed

vsantwana mentioned this pull request Jul 2, 2025

[FLINK-37895][Job Manager] Fix failing collection of Flink Exceptions for Session Jobs #988

Merged

[FLINK-37730][Job Manager] Expose JM exception as K8s exceptions #978

[FLINK-37730][Job Manager] Expose JM exception as K8s exceptions #978

Uh oh!

Conversation

vsantwana commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

rmetzger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gyfora commented May 6, 2025

Uh oh!

morhidi commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vsantwana commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

morhidi commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmetzger May 16, 2025

Choose a reason for hiding this comment

Uh oh!

vsantwana May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gyfora left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gyfora left a comment

Choose a reason for hiding this comment

Uh oh!

gyfora commented May 21, 2025

Uh oh!

gyfora commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gyfora May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gyfora May 23, 2025

Choose a reason for hiding this comment

Uh oh!

vsantwana commented May 6, 2025 •

edited

Loading

morhidi commented May 6, 2025 •

edited

Loading

vsantwana May 19, 2025 •

edited

Loading

gyfora commented May 21, 2025 •

edited

Loading